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ABSTRACT 

The purpose of this paper is to identify arid discuss 
some of the problems presented by the use of computerized adaptive 
testing (CAT) in an instructional programs environment versus large 
scale testing applications, and to describe an actual implementation 
of CAT in an instructional programs sett ing. This particular 
application is in the Electronic Technicians "A" Ieta) school at the 

6reat takes Naval Training Center, iiiinois. The goals of 

implementing CAT at this site were to increase test security^ improve 

the efficiency of the testing program^ and improve the quality of 

measurement yielded by^ the testing program; The problems encountered 
by this. CAT program include the unknown dimehsiohali ty of the tests, 
the small number of available items for the item pools^ and the 
availability of item response data only for small samples ^ the 
overall design of the project includes four phasest (1) preliminary 
analyses and software design; (2}_ implementation of a 

computer-administered cpriventibnal.test ; (3) implementation of a dual 
testing program (conventional and CAT); and (4) elimination of the 
conventional testing program. If the results are positive ^ this 
project will demonstrate that adaptive testing can effect improvement 
in classroom testing. (Author/BW) 
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The use of computerized adaptive testing (CAT) in an instructional 

programs environment presents a number of PJ^Qblems not encountered i?}_l3J^96__ 
scale adaptive ability testing applications • Among these are problems due to 
the achievement nature of_ the tests employed • Additional problems arise due 
to the small scale and classroom orientation of the instructional programs. 
The purpose of this paper is to identify and discuss some of these problems^ 
and to describe an actual iirplementation of CAT in an instructional programs 
setting; The discussion will begin with a description of the environment in 
which the CAT program was implemented, and a discussion of the special 
problems encountered; This will be followed with a description of the 
approach taken in this particular application. Finally^ the implications of 
this project and its results will be briefly discussed. 

The Implemehtatibh 

Environment 

This particular application of CAT is in progress at the Great Lakes Naval 
Training Center (GLNTC) atGreat Lakes, Illinois. More specif ically, the 
instructional program involved is in the Electronic Technicians "A" (ETA) 
school. It involves a six week course on radar that covers three major 
areas: primary power distribution, transmitters, and receivers. Each area is 
coyiered by a test given at the end of instruction on that area. Approximately 
700 students take the radar course each year, though the exact number varies 
from year to year. 

Students are separated into classes varying in size, but ranging around 
an average of about twenty; Classes are 'lock-stepped', rather than using 
individualized instruction, but not all classes are at the same point in the 
program at any one time. That is, at any one time some classes will be on the 
power section of the course > some will be covering transmitters^ and some will 
be studying receivers. To further confuse matters^ there are three shifts per 
day: a day shift, an evening shift, and a midnight shift. Thus, instruction 
and testing continue throughout a given twenty-four hour period. Mbredver* 
not all classes within a shift are covering the same material. 
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The conventionai test covering primary power distributibh had forty 
multiple-chbice items, each having four choices. The transmitter arid receiver 
tests each had thirty items. For each test there were two forms ^ with rib 
items iri cbmmori to the two forms. Thus^ there were eighty items available for 
the power test^ and sixty items available for each of the other two tests. 

Goals of CAT at the GLNTC 

The over ridirig cbricerri for the testing prbgram at _ the GLNTC is security. 
For various reasons^ test security becbmes comprbmised at a pheribmeribrial rate 
at this site, arid the ETA schbbl is rib exceptibri. Cbriveritibrial paper-arid- 
pericil tests become bbsblete due to cdmprdmised security almost as sddri as 
they arie produced. Because of this, drie df the iiiajdr rndtivatidris for 
impliemieritirig CAT is the improvement of test security. 

Aribther goal df CAT at the GLNTC is the imprdvemerit df the efficiency of 
the testing program. It is hoped that the implementation of CAT will 
significantly reduce the aitidunt df time required for testing, as well as the 
amdunt of time required of the staff for test administration. A relatively 
large number of students must pass through the testing program in a relatively 
short time, and any improvement in^ efficiency will be very impbrtant. 

Another impbrtant goal of the CAT prbgram at the GLNTC is the • imprbvemerit 
of the quality of measurement yielded by the testing prbgram. Under the 
circumstances prevailing at the GLNTC, decision errors due to pbor measurement 
can have seribus consequences. Very little iri the way bf resburces is 
available for remediation, for instance. It is very impbrtant^ theri, that 
examinees passed on to the next unit of study actually be cbmpetent bn the 
preceding units. This is especially impbrtant when bne cbnsiders that these 
students will eventually graduate and mbve bn tb the fleets where they will be 
respbrisible for maintairiirig arid bperatirig ships ' equipiherit. _ Orie wduld like to 
have some confidence that the people graduatirig frbm the ETA schbbl have, 
indeed^ mastered the material taught there. 

Special Problems 

While this section addresses directly problems encountered at the GLNTC, 
it is likely that many of these problems are typical of instructional programs 
elsewhere. Most of these problems are inherent to classroom achievement 
testing, rather than being due to any special circumstances unique to the 
GLNTC. 



One of the most serious problems encountered in adaptive achievement 
testing centers around the dimensionality of the tests i Achievement tests 
tend to be constructed using a table of specifications covering a variety of 
tbpics. Such tests often are highly multidimensibnal. CAT, bn the bther 
hand, is typically based on models and procedures requiring the assumptibri of 
unidimensionaiity. The conventionai GLNTC tests were based bn tables of 
specification, sb at the outset of the project the dimensionality of the tests 
was unknown. 
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Another prdblem ericburitered at the GLNTC stemmed from the fact that the 

conventional tests used were relatively short. No resources were available 
for a large item development project, sib the CAT item pools had to be 
constructed from the items available from the convent ibhal tests. 
y??9?^^D?tely r riot many items were available, so the resulting item pools were 
rather small. 



To further complicate matters, item response data fbr use in item 

calibration were available only for small samples. In large scale testing 
programs, data collection for item calibratibri is relatively simple. In 
classroom testing, however, it is difficult arid time consuming tb amass large 
sample sizes. This is made even more difficult by the great haste with which 
tests become compromised at the GLNTC. 

There are many other problems encountered in adaptive achievement testing 
that must be cbnsidered when implementing a CAT program in an instructional 
programs environment. Among these are questions about the concurrent, 
predictive^ and content validity of adaptive tests; the stability of 
achievement test dimensionality; and^ the effects of computerized adaptive 
administratibh bn item characteristics. All of these must be addressed if CAT 
is tb be used in instructional programs settirigs. * 

The Apprbach Taken at the GLNTC 

Overview 

^^si^" project includes fbur phases. _Tha first phase 

involves preliminary analyses to aid in the design of the sbftware, albhg with 
the actual designing of the sbftware. The secbhd phase of the project 
?"^9^^??_ i"^P^®"^®"*^^*^^°^ ^ computer administered cbvehtibhal test. The 
third phase includes the implementatibri bf a dual testing system which 
includes both a computerized conventional testing prbgram and a CAT program. 
The fourth phase involves elimination of the computerized cbhyentibnal testing 
program and expansion to other areas. The project is curreritly in the second 
phase. Each of these four phases will now be discussed, arid the butcbmes of 
the completed phases will be presented. 

^^ha^se^ 

Three primary tasks were undertaken during the first phase of this 
project. The first task was the completion of a study using simulation data 
that was designed to compare two different calibration models under conditions 
believed tb be similar, to those which would be encountered at the GLNTC. The 
second task was the cbllectibh and arialysis of response data for the 
conventional paper-arid-pericil tests for use in selecting a calibration model 
to be used in conjunction with item pool building. The third task involved 
the design of the test administratibh sbftware for adaptive testing, as well 
as software for a cbmputer administered conventional test. All three of these 
tasks were addressed under the constraint that the cbmputer hardware to be 
used had already been selected by Others. 
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Task 1 , For this task a two-stage study was conaucted to compare the 

ability estimates yielded by a<3aptive testing procedures based on the one- 

parameter logistic (IPL) and three-parameter logistic (SPE) models. The first 
stage of the study employed real response data, while the second stage 
employed simulated response data. 

In the first stage, respdnse data for 3000 examinees were obtained for 
the forty item ACT_Assessnient Mathematics Usage subtest jThe American College 
Testing Program, 1982) . The first 2000 cases were used to obtain item 
parameter estimates for both the IPL and the 3PE models using the LOGIST 
computer program (Wingersky, Barton, and Lord, 1982) .Using these estimates^ 
iPt and 3Pt adaptive tests were simulated using the response data for the 
remaihirig 1000 cases. Both adaptive testing procedures employed maxim jm 
likelihood ability estimation and maximum information item selection 
procedures. The two sets of ability estimates yielded by the two adaptive 
testing procedures were then compared. 

In the second stage* response data for 3CG0 cases were generated 
according to the 3PL model using as true parameters the_3PL item parameter 
estimates from the first stage. True abilities were selected from the 
standard normal distribution. The first 2000 cases were used for IPL and 3PL 
calibrations of the items, and the remaining 1000 cases were used to simulate 
iPE and 3PE adaptive tests.^ The two sets of ability estimates yielded by the 
two adaptive testing procedures ^'ire cb*npared to each other and to the true 
ability parameters. 

Results of this study are reported in detail in McKinley and Reckase 
(1983). They are summarized in Table 1, which shows the intercorrelations of 
the ability estimates for the real data, and Table 2, which shows the 
intercbrrelatibhs for the simulation data. In general, the results of both 
stages of the study indicated that the IPL and 3PE adaptive tests yielded very 
highly correlated ability estimates, and that there was no apparent advantage, 
in terms of ability estimation, to using one of the models over the^bther. _ 
This was attributed to the fact that, due to the small size of the item pool, 
both procedures administered a large proportion of the i^-em pool to each 
examinee. Thus, the two procedures administered much the same set of items to 
each examinee, and therefore yielded much the same ability esimate for each 
examinee. 

Task 2 . The second task of Phase I involved the collection and analysis 
of response data for the items in the item pool using the cbnventidnal paper- 
and-pencii test forms. Analyses performed on these data include principal 
components analyses^ item analyses^ and calibrations fbr the IPL and 3PIj 
models. The goal of these analyses was the evaluatibh of the appropriateness 
of the IPL and 3PL models (or any uhidimensibnal item response theory model) 
for use with these data. Data were available for approximately 400-500 
examinees • 

Table 3 shows the item analysis Jprdportion-correct difficult and 
point biserial discr imihatibhs) arid IRT calibration results for the 
transmitter item pool. These data are similar to those obtained for the other 



pools. The results of the item analyses and the item response theory (IRT) 
model calibrations shown in Table 3. indicated that the items were all quite 
easy. _ Propdrtion-cbrrect scores below 0,5 were rare. Because of this, 
considerable difficulty was ericburitered in the estimatibh cf the guessing 
parameter. The LOGIST program tended tb set the guessing value for ittbst of 
the itsms equal tb a constant value. This would seem to imply thet ? model 
with a cbnstant 'guessing factor cbuld oe used with these data; 

it was also discovered that the item discrimination values varied 
considerably. In the study described under the proceding section, item 
discriminations were uniformly highi Due to this and the smallness of the 
item pool, the IPt and 3PL adaptive tes5ting procedures yielded similar 
results. For these data, the item pool was small, but items varied in 
discrimination. Therefore, it was unclear to what extent the above study 
would generalize to these data; 

i 

in order to investigate this, another simulation study was conducted. 
The sixty 3PL item parameter estimates obtained for the items on the 
transmitter test were used as true parameters. U3ing theses the simulation _ 
dat design employed under Task 1 above was again applied. Information cutoffs 
for the two procedures were selected. to yield tests of roughly the same 
average length. Again, the IPL and 3PIj adaptive test ability estimates were 
compared to each other and tb the known true abilities. 

Table 4 shbws the ihtercbrrelatibn matrix for the IPL arid 3Pt adaptive 
test ability estimates and the true abilities. As can be seen, the 3PL 
estimates had a slightly higher correlation with the true values than did the 

IPL CAT estimates. Still, the IPL arid 3PL CAT estimates were highly 

Gbi/related. The IPL adaptive tests had an average test length of fifteen, 
while the average test length for the 3PL tests was thirteen. _These results 
support the conclusion that the little there is to gain _from use of the more 
complex 3PL procedure is probably not worth the added expense; B^a*^ in mind 
that what advantages there are to the 3Pt model come only with dramatically 
increased sample sizes, which in many cases might be impractical or impossible ' 
to obtain. 

The results of the principal components analyses indicate that^ while 
these tests are not truly unidimensibhal, there does tend tb be a dbminant 
first factor. The other factors present do not lend themselves to 
interpretation. They db nbt appear tb be assbciated with cbhterit br item 
type, and are therefore probably not important. 

Task 3 . Based oh the results of the first twb tasks of this phase, the 
decision was made to bLie the adaptive testing system on the IPL model. The 
prbcedure developed emplbys maximurh likelihood ability estimatibh and maximum 
infbrmatibn item selectibri. Testing is terminated when ho items remain unused 
that yield ah item irifbrmatipri value fbr the mbst recent estimate of ability 
greater than a specified minimum, or until twenty items have been 
administered. The examinee's ability estimate is increased by a fixed 

stepsize fox a cbrrect response and decreased by a fixed stepsize for an 

incorrect response until both a correct and an incorrect response have been 
obtained. Initial estimates bf ability were selected sb as tb represent 
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difficulty values near the mode of the item pool irifbrmatidn function. The 
actual values for these parameters will hot be determined until the cbitipletidn 
of Phase il is near. 

In additibh to the CAT software, a computerized conventional test 
administration program was produced during Phase I. This program administers 
to ah examinee_the same set of items as appeared on the paper-and-pencil form 
of the test,. Items are administered in a randomized order for test security 
purposes, 

phase II 

Three primary tasks are included in Phase li ^^.^^^^ project. The first 
task is the implementation of a computerized conventional testing program. 
The second task is the collection and analysis of response data from the 
computerized conventional tests for the purpose of updating the calibration 
results for the CAT item pools, The third task involves research directed at 
the investigation of the effects of computer administration on item 
characteristics. 

Task 1 , Initiation of tiie computerized cbnventibnal testing program 
occurred in late February. The program was implemented simultariebusly for the 
three areas - primary power distribution^ transmitter , and receivers. As was 
indicated previously in this program^ items are selected and administered in a 
randomized order. 

The purpbses of this program are twbfbld. First, the program is 
nei ssary fbr bbtaihihg additibrial response data for the item pool 
calibrations that are hot cbritamiriated due to compromised test security. 
Second/ this prbgram will yield data useful for assessing the effects of 

cbmputer admihistratibh bh item character istics, par ticularly item 

difficulty. To date^ insufficient data have been collected for meaningful 
analysis. 

Task 2 , The second task of Phase II will include item analyses, IRT 
analyses, and factor analyses of both the paper-and-pencil data and the 
cdmputerized testing data. The purpose of these analyses is to assess the 
effects of computer administration on item difficulty^ item discr imihatibn^ 
and the dimensionality of the item pools. This phase will commence once 
sufficient data have been collected from the computerized convent ibha?. testing 
program. 

Task 3 . The nature of the third task of Phase II will depend on the 
results of the analyses of the data collected frbm the computerized 
conventional testing program. Once these data have been analyzed, it will be 
determined whether or not these new data can be cbmbined_wi th the bid iri brder 
to obtain new item pool calibrations. If the twb_sets bf data cahhbt be 
combined, adaptive testing will cbmmence when sufficient data for calibration 
of the item pools have been obtained from the computerized conventional 
testing prbgram. 
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Phase III 



The primary tasks of Phase III include the initiation of the CAT program, 
arid the evaluation of the validity of the CAT program. During this phase, the 
CAT and computerized conventional testing programs will be run concurrently. 

Each examinee will be administered both. The purpose of this is the 

collect ion of data useful for a direct comparison of the GAT program to the 
cdriveritidrial testing program. Similarities in the results of the two types of 
test will be considered to be evidence in support of the validity of the CAT 
program. 

Phase IV 

The fourth phase of the project includes two main objectives. Firsts once 
sufficient evidence for the validity of the CAT program. has been collected, 
the computerized conventional testirig program will be eliminated. Also, at 
this point work will commence on the exparisipri of the project to include other 
courses in the ETA school^ and perhaps to other schools. 



Once the CAT prbgrara has replaced the cbrivehtibnal testing program, 
other > more Ibng-term research projects will be undertaken. A^^ong these are 
the investigation bf the stability of the item pool dimensionality (and 
calibratibri results) over time. Also, research will be conducted on 
procedures fbr the calibratibri bf new items for the CAT item pools. 

Implic ations 

This project is important far beyond any value assigned to the research 
results^ which will be quite important in themselves. This added significance 
derives from the nature of the project itself - the application of adaptive 
testing in the classroom. If the results of this project are positive^r it 
will dembristrate that adaptive testing can effect improvement in an area of 
great signif icarice. 
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Table 1 



Intercorreiation Matrix for Ability Parameter 
Estimates for the Real Data 



Ability Adaptive Tests Paper-an^ 

Estimate IPL 3PIi IPL 3Pir 

Adaptive IPL 1.00 0.77 0.8? 0,87 

3P£ 1.00 0.81 0.86 

P & P IPL 1*00 0,95 

3PL 1*00 



Table 2 

Iritercdrrelatiori Matrix for True and Estimated Abilities 
for the Simulation Data 



Ability True Adaptive Tests Paper-and Pencil Tests 

Estimate IPL 3PL IPL 3PL 



True 1.00 0.88 0.82 0.90 0.8? 

Adaptive iPL 1.00 0.81 0.93 0.92 

3PL 1.00 0.83 0.85 

P S P IPL 1^00 0.93 

3PL 1-00 
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Table 3{Cbhtihued5 
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0.78 


0.12 


-1.64 


0.40 


-1.40 


0.21 


52 




0.87 


0.14 


-2.39 


0.53 


-1.93 


0.21 


53 




0.85 


0.14 


-2.20 


0.56 


-1.65 


0.21 


54 




0.96 


0.03 


-4.0? 


0.35 


-5.17 


0.21 


55 




0.78 


0.11 


-1.61 


0.39 


-1.36 


0.21 


56 




0.99 


0.13 


-6.01 


1.28 


-2.99 


0.21 


57 




0.89 


0.11 


-2.65 


0.49 


-2.33 


0.21 


58 




0.24 


0.15 


1.75 


0.46 


25.37 


0.21 


59 




0.85 


0.17 


-2.20 


0.58 


-1.62 


0.21 


60 




0.89 


0.23 


-2.62 


0.86 


-1.58 


0.21 
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Table 4 



Intercorrelation Matrix for Trae and Estimated Abilities 
for Simulated Data for the Transmitter Item Pool 



Ability True Adaptive 

Estimate IPL 3P£r- 

True l.dd 0.71 0.78 

Adaptive IPL 1.00 0.77 

3PL 1^00 



Implemeritirig an Adaptive Testing Program 
in an Ihstructidrial Programs Ehvirdrimerit 

Abstract 



The use of computerized adaptive testing (CAT) in an irist riictidrial 
prdgrams environment presents a number of problems not ericburitered in large 
scale adaptive ability testing applications. Among these are prbblems diie !"b 
the achievement nature of the tests employed. Additional prpbleras ar:'se du^ 
tb the small scale and classrobm brientatibri bf the instructibrial programs . 
In this p^pisr, sbme of these prbblems are identified arid discussed. Iri 
addition, an actual implementation of CAT in an instructional prdgrams setting 
is described, and the special prbblems ericburitered iri that implemeritatibri are 
discussed. 
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